About project

Our project consist of exploratory data analysis (EDA) and predictive analytics on 2016 Global Ecological Footprint dataset. We choose this dataset since ecological footprint is very important measure which can tell us about magnitude of our impact on our planet. Learning from ecological data can be very useful since it can point to variables which are hurting nature the most.

Loading R packages

library(tidyverse)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(psych)
library(leaflet)
library(naniar) 
library(knitr)
library(kableExtra)
library(plotly)
library(cowplot)
library(caret)
library(randomForest)
library(rpart.plot)
library(rpart)
library(rgl)

Know Your Data (and a bit of feature engineering)

dat <- read.csv("data/countries.csv", encoding="UTF-8", stringsAsFactors = F)
glimpse(dat)
## Observations: 188
## Variables: 21
## $ Country                        <chr> "Afghanistan", "Albania", "Alge...
## $ Region                         <chr> "Middle East/Central Asia", "No...
## $ Population..millions.          <dbl> 29.82, 3.16, 38.48, 20.82, 0.09...
## $ HDI                            <dbl> 0.46, 0.73, 0.73, 0.52, 0.78, 0...
## $ GDP.per.Capita                 <chr> "$614.66", "$4,534.37", "$5,430...
## $ Cropland.Footprint             <dbl> 0.30, 0.78, 0.60, 0.33, NA, 0.7...
## $ Grazing.Footprint              <dbl> 0.20, 0.22, 0.16, 0.15, NA, 0.7...
## $ Forest.Footprint               <dbl> 0.08, 0.25, 0.17, 0.12, NA, 0.2...
## $ Carbon.Footprint               <dbl> 0.18, 0.87, 1.14, 0.20, NA, 1.0...
## $ Fish.Footprint                 <dbl> 0.00, 0.02, 0.01, 0.09, NA, 0.1...
## $ Total.Ecological.Footprint     <dbl> 0.79, 2.21, 2.12, 0.93, 5.38, 3...
## $ Cropland                       <dbl> 0.24, 0.55, 0.24, 0.20, NA, 2.6...
## $ Grazing.Land                   <dbl> 0.20, 0.21, 0.27, 1.42, NA, 1.8...
## $ Forest.Land                    <dbl> 0.02, 0.29, 0.03, 0.64, NA, 0.6...
## $ Fishing.Water                  <dbl> 0.00, 0.07, 0.01, 0.26, NA, 1.6...
## $ Urban.Land                     <dbl> 0.04, 0.06, 0.03, 0.04, NA, 0.1...
## $ Total.Biocapacity              <dbl> 0.50, 1.18, 0.59, 2.55, 0.94, 6...
## $ Biocapacity.Deficit.or.Reserve <dbl> -0.30, -1.03, -1.53, 1.61, -4.4...
## $ Earths.Required                <dbl> 0.46, 1.27, 1.22, 0.54, 3.11, 1...
## $ Countries.Required             <dbl> 1.60, 1.87, 3.61, 0.37, 5.70, 0...
## $ Data.Quality                   <chr> "6", "6", "5", "6", "2", "6", "...
dat$GDP.per.Capita <-  as.numeric(gsub('[$,]', '', dat$GDP.per.Capita))

# conversion to factors
dat$Country <- as.factor(dat$Country)
dat$Region <- as.factor(dat$Region)

We can see that 19 of the columns are numeric types. Column GDP.per.Capita had to be converted to double type. Country and Region were converted to factors. We will ommit Data.Quality because we will not use it anywhere in this project.

dat$Data.Quality <- NULL

Checking for missing data

It’s always important to check for missing values and consider how to fix them.

missing_data <- dat %>% summarise_all(funs(sum(is.na(.))/n()))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.
missing_data <- gather(missing_data, key = "variables", value = "percent_missing")
missing_data <- missing_data[missing_data$percent_missing > 0.0, ] 
ggplot(missing_data, aes(x = reorder(variables, percent_missing), y = percent_missing)) +
geom_bar(stat = "identity", fill = "red", aes(color = I('white')), size = 0.3, alpha = 0.8)+
xlab('variables')+
coord_flip()+ 
theme_fivethirtyeight() +
  ggtitle("Missing Data By Columns",
          subtitle = "HDI has the highest percentage")

table1_dat <- dat[is.na(dat$HDI), c(1,2)]
rownames(table1_dat) <- NULL
table1_dat %>% kable(caption = "Countries with missing data") %>%  kable_styling("striped", full_width = T) %>% row_spec(c(4,15), bold = T, background = "lightblue")
Countries with missing data
Country Region
Aruba Latin America
Bermuda North America
British Virgin Islands Latin America
Cayman Islands Latin America
Côte d’Ivoire Africa
French Guiana Latin America
French Polynesia Asia-Pacific
Guadeloupe Latin America
Korea, Democratic People’s Republic of Asia-Pacific
Martinique Latin America
Montserrat Latin America
Nauru Asia-Pacific
New Caledonia Asia-Pacific
Réunion Africa
Somalia Africa
Wallis and Futuna Islands Asia-Pacific

They are all pretty much small countries (with exceptions like Côte d’Ivoire, Somalia etc.) .

Data splitting & imputation

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.608    42.16 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.519    40.71 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.593    41.91 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.662    43.03 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.736    44.21 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.706    43.74 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.644    42.73 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.641    42.69 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |     2.61    42.18 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.582    41.74 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.675    43.24 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.628    42.48 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.664    43.06 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.712    43.84 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.724    44.03 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.655    42.92 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.633    42.56 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.762    44.65 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.765    44.70 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 |    2.735    44.21 |

Visualising all numeric columns

It’s useful to show histograms of all numeric columns.

multi.hist(dat[,sapply(dat, is.numeric)])

Most of the variables (Population..millions, Fishing.Water, etc.) have right skewed distributions. We will inspect Total Ecological Footprint more detaily since it is our target variable.

dat %>% ggplot(aes(x = Total.Ecological.Footprint)) +
  geom_histogram(bins = 30, aes(y = ..density..), fill = "purple") + 
  geom_density(alpha = 0.2, fill = "purple") +
  theme_fivethirtyeight() +
  ggtitle("Total per Capita Footprint") +
  theme(axis.title = element_text(), axis.title.x = element_text()) +
  geom_vline(xintercept = mean(dat$Total.Ecological.Footprint), size = 2, linetype = 3) +
  annotate("text", x = 7, y = 0.35, label = "Average per Capita Footprint is 3.32")

Biggest polluters

dat %>% arrange(desc(Total.Ecological.Footprint)) %>% select(Total.Ecological.Footprint, Country) %>% head(n = 15) %>% dplyr::rename(Footprint_per_Person = Total.Ecological.Footprint) %>%
kable(caption = "Biggest polluters - by countries", col.names = c("Footprint per Person",
                           "Country")) %>%
    kable_styling("striped", full_width = F) %>%
  row_spec(1:5, bold = T, color = "white", background = "#D7261E")
Biggest polluters - by countries
Footprint per Person Country
15.82 Luxembourg
11.88 Aruba
10.80 Qatar
9.31 Australia
8.22 United States of America
8.17 Canada
8.13 Kuwait
7.97 Singapore
7.93 United Arab Emirates
7.92 Trinidad and Tobago
7.78 Montserrat
7.52 Oman
7.49 Bahrain
7.44 Belgium
7.25 Sweden

Smallest polluters

dat %>% arrange((Total.Ecological.Footprint)) %>% select(Total.Ecological.Footprint, Country) %>% head(n = 15) %>% dplyr::rename(Footprint_per_Person = Total.Ecological.Footprint) %>%
kable(caption = "Smallest polluters - by countries", col.names = c("Footprint per Person",
                           "Country")) %>%
    kable_styling("striped", full_width = F) %>%
  row_spec(1:5, bold = T, color = "white", background = "#56AB6F")
Smallest polluters - by countries
Footprint per Person Country
0.42 Eritrea
0.48 Timor-Leste
0.61 Haiti
0.72 Bangladesh
0.79 Afghanistan
0.79 Pakistan
0.80 Burundi
0.81 Malawi
0.82 Congo, Democratic Republic of
0.87 Mozambique
0.87 Rwanda
0.91 Tajikistan
0.93 Angola
0.98 Nepal
0.99 Madagascar

Visualising correlation

k <- dat[, sapply(dat, is.numeric)]
k <- k[complete.cases(k), ]
korelacija <- cor(k)
corrplot(korelacija, method = "color", tl.cex = 0.825, title = "Pearson correlation", mar=c(0,0,1,0))

k2 <- dat[, sapply(dat, is.numeric)]
k2 <- k2[complete.cases(k2), ]
korelacija2 <- cor(k2, method = "spearman")
corrplot(korelacija2, method = "color", tl.cex = 0.825, title = "Spearman correlation", mar = c(0,0,1,0))

We can see that the results are different, so we can conclude that the linear relationship is probably not the best guess.

Exploring by Regions

dat %>% group_by(Region) %>% tally() %>% 
  ggplot(aes(x = reorder(Region, n), n)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  theme_fivethirtyeight() +
  ggtitle("Number of countries by regions") +
  geom_text(aes(x = Region, y = 1, label = paste0(n)),
            hjust=0.15, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  coord_flip()

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Linear regression

We will use linear regression to explain per capit footprint by serveral predictors. Both economical and geographical features will be used.

Economical Features

linear_model1 <- lm(Total.Ecological.Footprint ~ HDI, data = dat)
summary(linear_model1)
## 
## Call:
## lm(formula = Total.Ecological.Footprint ~ HDI, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6459 -0.9784 -0.3301  0.6633 10.3107 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.3021     0.5423  -7.933 2.74e-13 ***
## HDI          11.0241     0.7706  14.306  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.572 on 170 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.5462, Adjusted R-squared:  0.5436 
## F-statistic: 204.7 on 1 and 170 DF,  p-value: < 2.2e-16
linear_model2 <- lm(Total.Ecological.Footprint ~ I(exp(HDI)), data = dat)
summary(linear_model2)
## 
## Call:
## lm(formula = Total.Ecological.Footprint ~ I(exp(HDI)), data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5625 -0.9288 -0.4152  0.5897 10.0971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.3649     0.7785  -10.74   <2e-16 ***
## I(exp(HDI))   5.7853     0.3829   15.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 170 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.5731, Adjusted R-squared:  0.5706 
## F-statistic: 228.3 on 1 and 170 DF,  p-value: < 2.2e-16
linear_model3 <- lm(Total.Ecological.Footprint ~ HDI + I(HDI**2), data = dat)
summary(linear_model3)
## 
## Call:
## lm(formula = Total.Ecological.Footprint ~ HDI + I(HDI^2), data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5236 -0.8254 -0.2381  0.3253  9.5480 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.502      1.985   3.275 0.001280 ** 
## HDI          -23.827      6.238  -3.820 0.000188 ***
## I(HDI^2)      26.481      4.709   5.623 7.61e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.447 on 169 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.6178, Adjusted R-squared:  0.6132 
## F-statistic: 136.6 on 2 and 169 DF,  p-value: < 2.2e-16
slr <- ggplot(dat, aes(HDI, Total.Ecological.Footprint)) +
  geom_point(aes(text = Country)) +
  geom_smooth(method= "lm", color = "red", linetype = 1, se=F) +
  geom_smooth(method= "lm", formula = (y ~ x + I(x**2)), color = "blue", linetype = 2, se=F) +
  ggtitle("Simple Linear Regression",
          subtitle = "Model With Quadratic Term does much better") 
## Warning: Ignoring unknown aesthetics: text
ggplotly(slr, tooltip = "text")
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
slr <- ggplot(dat, aes(GDP.per.Capita, Total.Ecological.Footprint)) +
  geom_point(aes(text = Country)) +
  geom_smooth(method= "lm", color = "red", linetype = 1, se=F) +
  geom_smooth(method= "lm", formula = (y ~ x + I(x**2)), color = "blue", linetype = 2, se=F) +
  ggtitle("Simple Linear Regression",
          subtitle = "Model With Quadratic Term does much better") 
## Warning: Ignoring unknown aesthetics: text
ggplotly(slr, tooltip="text")
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
ggplot(dat, aes(x = HDI, y = GDP.per.Capita)) +
  geom_point() +
  theme_fivethirtyeight() +
  ggtitle("HDI vs. GDP per Capita") +
   theme(axis.title = element_text(), axis.title.x = element_text())
## Warning: Removed 17 rows containing missing values (geom_point).

It should be obvious that HDI and GDP are not linearly correlated. Exponential function would be much more suitable.

multiple1 <- lm(Total.Ecological.Footprint ~ GDP.per.Capita + HDI + I(GDP.per.Capita**2) + I(HDI**2) , data = dat)
summary(multiple1)
## 
## Call:
## lm(formula = Total.Ecological.Footprint ~ GDP.per.Capita + HDI + 
##     I(GDP.per.Capita^2) + I(HDI^2), data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8412 -0.7107 -0.1608  0.5324  5.3628 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         -2.853e+00  2.263e+00  -1.261  0.20920   
## GDP.per.Capita       8.304e-05  2.503e-05   3.318  0.00111 **
## HDI                  1.100e+01  7.758e+00   1.419  0.15791   
## I(GDP.per.Capita^2) -1.535e-10  2.077e-10  -0.739  0.46083   
## I(HDI^2)            -5.035e+00  6.603e+00  -0.762  0.44690   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.259 on 166 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.715,  Adjusted R-squared:  0.7081 
## F-statistic: 104.1 on 4 and 166 DF,  p-value: < 2.2e-16
ggplot(dat, aes(x = Population..millions., y = Total.Ecological.Footprint)) +
  geom_point() +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(), axis.title.x = element_text()) +
  ggtitle("Using population as predictor",
          subtitle = "not really useful as predictor") +
  annotate("text", x = 800, y = 10,label = "Pearson correlation is -0.06", color = "red", size = 9)

Geographical Features

crops <- ggplot(dat, aes(x = Cropland, y = Total.Ecological.Footprint)) +
  geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#e6a526") +  theme_fivethirtyeight() +
  ggtitle("Crop Land") +
   theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")


urban <- ggplot(dat, aes(x = Urban.Land, y = Total.Ecological.Footprint)) +
  geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#FF7F50")+
  theme_fivethirtyeight() +
  ggtitle("Urban Land") +
   theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")

forest <- ggplot(dat, aes(x = Forest.Land, y = Total.Ecological.Footprint)) +
  geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#56AB6F")+
  theme_fivethirtyeight() +
  ggtitle("Forest Land") +
   theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")

fishing <- ggplot(dat, aes(x = Fishing.Water, y = Total.Ecological.Footprint)) +
  geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#504FB1")+
  theme_fivethirtyeight() +
  ggtitle("Fishing Water") +
   theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")


cowplot::plot_grid(urban, forest, fishing, crops, ncol = 2, nrow = 2,
                   label_y = "Total Footprint")
## Warning: Removed 15 rows containing missing values (geom_point).

## Warning: Removed 15 rows containing missing values (geom_point).

## Warning: Removed 15 rows containing missing values (geom_point).

## Warning: Removed 15 rows containing missing values (geom_point).


It can be seen increasing trends in all variables, except for Forest Land.

Evaluating linear model

HDI <- dat$HDI
GDP <- dat$GDP.per.Capita <- as.numeric(gsub('[$,]', '', dat$GDP.per.Capita))

final_multiple <- lm(Total.Ecological.Footprint ~ GDP.per.Capita + HDI + I(GDP.per.Capita**2) + I(HDI**2) , data = dat_train_imputed)
predicted <- predict(multiple1, newdata = dat_test)


knitr::include_graphics(c("visual/cube1.png", "visual/cube2.png"))

RMSE_mlr <- sqrt(mean((predicted[complete.cases(predicted)]- dat_test$Total.Ecological.Footprint[complete.cases(predicted)])**2))

Regression Tree

Regression trees are simple regression tehnique which can be easily interpreted and visualised. Disadvantages are that they are instable and sensitive to very small change in data. It is crucial to avoid overfitting when using regression trees.

d_tree <- rpart(Total.Ecological.Footprint ~ Population..millions. + HDI + GDP.per.Capita + Cropland + Grazing.Land + Forest.Land + Fishing.Water + Urban.Land + Total.Biocapacity, data = dat_train_imputed)

rpart.plot(d_tree, main="Regression Tree", fallen.leaves=T, box.palette="GnBu")

predictions_d_tree <- predict(d_tree, newdata = dat_test)
RMSE_regression_tree <- sqrt(mean((predictions_d_tree[complete.cases(dat_test)] - dat_test$Total.Ecological.Footprint[complete.cases(dat_test)])**2))

Random Forest Regression

We will use 10-fold cross validation to select optimal values for hyperparameters: mtry (randomly selected variables) and ntree (total number of trees in random forest)

cv.10.folds <- createMultiFolds(dat_train_imputed, k = 10, times = 5) # cross-validation 10 folds
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 3, index = cv.10.folds)

rf_model <- train(Total.Ecological.Footprint ~ Population..millions. + HDI + GDP.per.Capita + Cropland + Grazing.Land + Forest.Land + Fishing.Water + Urban.Land + Total.Biocapacity, data = dat_train_imputed, method = "rf", trControl = ctrl.1, tuneLength = 5, importance = T)
importance <- varImp(rf_model)
finaldf <- importance$importance
variable_names <- rownames(finaldf)
finaldf$Variables <- variable_names

We can visualize importance of used predictors.

ggplot(finaldf, aes(x = reorder(Variables, Overall), Overall)) +
geom_point(stat = "identity", size = 4, pch = 21, fill = "blue") +
geom_bar(stat = "identity", width = 0.075) +
xlab("Variables Names") +
coord_flip() +
theme_fivethirtyeight() +
ggtitle("Variables Ordered By Importance")

Predicting values

predictions_rf <- predict(rf_model, newdata = dat_test[complete.cases(dat_test), ])


y_real <- dat_test[complete.cases(dat_test), "Total.Ecological.Footprint"]

RMSE_rf <- sqrt(mean((predictions_rf - y_real)**2))

results <- tibble(
  x = y_real,
  y = predictions_rf,
  error = abs(y_real - predictions_rf)
)

ggplot(results, aes(x, y)) +
  geom_point(size = 3, shape = 1)  +
  geom_abline(slope = 1, intercept = 0, color = "blue", linetype = 2) +
  theme_fivethirtyeight() +
  ggtitle("Observed vs. Predicted") + 
  annotate("text", x = 5, y = 1.5, label = "RMSE = 1.17", color = "purple", size = 10)

We can see that random forest algorithm got better results than regression tree. That’s not big of a surprise because random forests are well known for robustness and good scores.

Summary of models

models <- tibble(
  model_name = character(),
  RMSE = numeric()
)
models <- add_row(models, model_name = "Multiple linear regression", RMSE = RMSE_mlr)
models <- add_row(models, model_name = "Regression tree", RMSE = RMSE_regression_tree)
models <- add_row(models, model_name = "Random forests", RMSE = RMSE_rf)

models %>% kable(caption = "RMSE of different models", col.names = c("Model name", "RMSE")) %>%  kable_styling("striped", full_width = T)
RMSE of different models
Model name RMSE
Multiple linear regression 0.8865465
Regression tree 1.1031298
Random forests 0.8785311

The End!